Food deserts are communities where residents have limited access to affordable and nutritious food, particularly fresh fruits and vegetables. These areas are often defined by their lack of supermarkets or grocery stores within convenient traveling distance and typically overlap with low-income or marginalized populations. According to the USDA, food deserts are identified as census tracts meeting both low-income and low-access criteria.
Our project was motivated by the critical public health and social equity implications of food deserts. Millions of Americans—especially those living in rural regions, minority communities, or impoverished neighbourhoods—struggle to access essential nutrition, which may contribute to broader systemic issues including chronic health problems, economic stagnation, and inter-generational poverty. Given the increased awareness of health disparities during the COVID-19 pandemic, we saw an opportunity to use data science tools to provide insights that could support more equitable food systems across the U.S.
This project aimed to use data-driven methods to explore the geographic patterns, contributing factors, and potential policy interventions related to food deserts in the United States. Through exploratory data analysis, statistical modelling, and predictive analytics, our team sought to understand the root causes and potential solutions to this pressing issue.
Food security is a fundamental determinant of health and well-being, and addressing food deserts requires detailed analysis. We were especially interested in exploring how machine learning models could not only reveal trends but also inform actionable policies. So, we set forth the following SMART questions:
Where are food deserts most geographically concentrated across the U.S., and how do these concentrations differ between urban and rural census tracts?
What demographics and socioeconomic factors are linked to food deserts?
How are food deserts related to health issues like obesity and diabetes?
Can we develop a predictive model that accurately identifies high-risk areas likely to be or become food deserts based on social and economic indicators?
What national policies could reduce food deserts based on our findings?
Prior research has shown that food deserts are more prevalent in areas with high poverty, low vehicle ownership, and limited access to transportation. Studies have documented the link between food access and diet-related health outcomes, although the strength of this relationship varies by region and demographic group. Some researchers argue that physical proximity to food is only part of the problem—affordability, cultural preferences, and education also play key roles.
For instance, Ver Ploeg et al. (2009) in a USDA report highlighted that while low-income communities often lack large grocery stores, they may still have small corner stores that offer limited or unhealthy food options. Other studies, such as those by Walker et al. (2010), have shown that the presence of food deserts correlates with increased prevalence of obesity and Type 2 diabetes, but access alone does not guarantee improved health outcomes.
Our project builds on these findings by integrating datasets across health, economic, and demographic dimensions to assess both the presence and the effects of food deserts.
We utilized and downloaded 2 public datasets directly from USDA:
Food Access Research Atlas: Provides tract-level indicators of food access based on income and distance to supermarkets.
Food Environment Atlas: Provides supplementary food environment factors—such as store/restaurant proximity, food prices, food and nutrition assistance programs, and community characteristics—interact to influence food choices and diet quality.
We chose LILATracts_1And10 as the proxy variable for Food Desert. This variable sufficiently captures the definition of a food desert as set forth by the USDA.
LILATracts_1And10 is a binary indicator variable (1 = true, 0 = false) that identifies census tracts that meet both of the following criteria:
Low-Income Tract (LI):
The tract’s poverty rate is ≥ 20%,
or the tract’s median family income is ≤ 80% of the state (or metropolitan area) median.
Low-Access Tract (LA):
At least 500 people or 33% of the population in the tract reside more than:
1 mile (urban) or
10 miles (rural) from the nearest supermarket, supercentre, or large grocery store
The Food Access Atlas had 147 features in total including LILATracts_1And10, and the Food Environment Atlas had a separate table for each cohort like stores, restaurants, access, etc. with a total of 281 features. All tables were joined using CountyFIPS codes to ensure geographic consistency and the combined dataset had 428 features in total. There was a total of around 74k observations in the aggregated data. Since the variable list was large, we needed to implement a robust feature selection process to filter the features for modelling.
The following steps were employed in sequence to clean the data:
Dropping columns based on fill-rate: Only variables with a fill rate of >=90% were retained.
Data imputation: Of the remaining columns, the missing values of numerical variables were imputed with the median of its respective column.
Data type conversion: The binary columns, including the target variable, were converted to factor type, which would be useful when used with the glm model.
Data standardization: The numeric columns were standardized using z-score transformation which is helpful for faster convergence during modelling.
Random sampling: Since there were 74k observations, we needed to subset the dataset to use only a 30% random sample (~22k rows) for model training purposes.
Once the data was adequately cleaned and sub-sampled, we performed the following steps to filter the features:
Iterative Univariate feature selection: We built a logistic regression model using one variable from the feature list as the independent (X) and FoodDesert as the dependent variable (y). Once all the models were built, only the features wherein the model p-value was statistically significant (<0.05) were selected.
Penalized Lasso regression: The second step was to perform a lasso penalization on the logit model using all the variables from the previous step. Lasso tends to make the coefficients of the weak variables to zero thereby eliminating them.
Iterative feature elimination with vif: The features output from the lasso model were then used to build logit model and the corresponding vif scores of the variables were calculated. The feature with the maximum vif score was removed and this modelling process was repeated until the max vif score was <5.
Iterative feature elimination by p-value: The final step involved filtering the features by the p-value of the coefficients. Since there were about 100 variables at this stage, we put a stringent threshold for the p-value (<0.01) to make sure only the strongest factors affecting FoodDesert were retained.
At the end of the feature selection process, there were a total of 27 variables which were strongly linked to FoodDesert.
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(ggthemes)
library(caret)
library(randomForest)
library(pROC)
library(car)
library(broom)
library(reshape2)
library(plotly)
library(patchwork)
library(tidyr)
library(corrplot)
library(glmnet)
library(randomForest)
raw_df <- read_csv("../dataset/FoodAccessResearchAtlasData2019.csv")
str((raw_df))
df_fixed <- raw_df %>%
mutate(across(where(is.character), ~ na_if(., "NULL")))
missing_pct <- colMeans(is.na(df_fixed))
df_clean <- df_fixed[, missing_pct <= 0.10]
str(df_clean)
According to the USDA definitions, a “food desert” is typically a low-income tract that also has low access to supermarkets based on established distance criteria. Here, we use the ‘LILATracts_1And10’ column (which applies a 1-mile threshold for urban and a 10-mile threshold for rural areas) as an indicator. We assume that a value of 1 in ‘LILATracts_1And10’ indicates that the tract qualifies as a food desert.
df$FoodDesert <- as.numeric(as.character(df$LILATracts_1And10))
state_urban_counts <- df %>%
group_by(State, Urban) %>%
summarise(FoodDesert = sum(FoodDesert, na.rm = TRUE), .groups = 'drop')
food_desert_pivot <- state_urban_counts %>%
pivot_wider(names_from = Urban, values_from = FoodDesert, values_fill = 0) %>%
rename(`Rural Food Desert Count` = `0`, `Urban Food Desert Count` = `1`) %>%
mutate(`Total Food Desert Count` = `Rural Food Desert Count` + `Urban Food Desert Count`) %>%
arrange(desc(`Total Food Desert Count`))
head(food_desert_pivot, 10)
state_totals <- df %>%
group_by(State) %>%
summarise(
TotalFoodDeserts = sum(FoodDesert, na.rm = TRUE)
) %>%
arrange(desc(TotalFoodDeserts))
state_totals_top10 <- head(state_totals, 10)
ggplot(state_totals_top10, aes(x = reorder(State, TotalFoodDeserts), y = TotalFoodDeserts)) +
geom_bar(stat = "identity", fill = "coral") +
coord_flip() +
labs(
title = "Top 10 States by Number of Food Deserts",
x = "State",
y = "Number of Food Deserts"
) +
theme_minimal(base_size = 14)
state_summary <- df %>%
group_by(State, Urban) %>%
summarise(
TotalFoodDeserts = sum(FoodDesert, na.rm = TRUE),
TotalTracts = n(),
FoodDesertPercentage = 100 * TotalFoodDeserts / TotalTracts,
.groups = 'drop'
)
urban_summary <- state_summary %>%
filter(Urban == 1) %>%
arrange(desc(FoodDesertPercentage))
rural_summary <- state_summary %>%
filter(Urban == 0) %>%
arrange(desc(FoodDesertPercentage))
urban_plot <- ggplot(urban_summary, aes(x = reorder(State, FoodDesertPercentage), y = FoodDesertPercentage)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Urban Food Desert % by State",
x = "State",
y = "Urban Desert %"
) +
theme_minimal(base_size = 12)
rural_plot <- ggplot(rural_summary, aes(x = reorder(State, FoodDesertPercentage), y = FoodDesertPercentage)) +
geom_bar(stat = "identity", fill = "seagreen") +
coord_flip() +
labs(
title = "Rural Food Desert % by State",
x = "State",
y = "Rural Desert %"
) +
theme_minimal(base_size = 12)
urban_plot + rural_plot + plot_layout(ncol = 2)
We used tract-level data to identify hotspots of food insecurity. We found the following observations:
Overall, food deserts were concentrated mostly in the southern region of the US. This spatial distribution is likely influenced by factors such as:
Urban sprawl and zoning in cities
Lack of commercial incentives in rural markets
Poor infrastructure and limited transportation
foodatlas <- read_csv("../dataset/FoodAccessResearchAtlasData2019.csv", col_types = cols(CensusTract = col_character()))
socioeconomic <- read_csv("../dataset/FE_socioeconomic.csv")
insecurity <- read_csv("../dataset/FE_insecurity.csv")
health <- read_csv("../dataset/FE_health.csv")
stores <- read_csv("../dataset/FE_stores.csv")
restaurants <- read_csv("../dataset/FE_restaurants.csv")
taxes <- read_csv("../dataset/FE_taxes.csv")
local <- read_csv("../dataset/FE_local.csv")
access <- read_csv("../dataset/FE_access.csv")
state_data <- read_csv("../dataset/FE_supplemental_data_state.csv")
county_data <- read_csv("../dataset/FE_supplemental_data_county.csv")
clean_nulls <- function(df) {
df %>% mutate(across(where(is.character), ~ na_if(., "NULL")))
}
foodatlas <- clean_nulls(foodatlas)
socioeconomic <- clean_nulls(socioeconomic)
insecurity <- clean_nulls(insecurity)
health <- clean_nulls(health)
stores <- clean_nulls(stores)
restaurants <- clean_nulls(restaurants)
taxes <- clean_nulls(taxes)
local <- clean_nulls(local)
access <- clean_nulls(access)
state_data <- clean_nulls(state_data)
county_data <- clean_nulls(county_data)
foodatlas <- foodatlas %>%
mutate(CensusTract = str_pad(CensusTract, 11, pad = "0"),
CountyFIPS = substr(CensusTract, 1, 5))
socioeconomic <- socioeconomic %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
insecurity <- insecurity %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
health <- health %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
stores <- stores %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
restaurants <- restaurants %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
taxes <- taxes %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
local <- local %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
access <- access %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
county_data <- county_data %>% mutate(CountyFIPS = str_pad(FIPS, 5, pad = "0"))
# Merge everything into foodatlas
merged_df <- foodatlas %>%
left_join(socioeconomic, by = "CountyFIPS") %>%
left_join(insecurity, by = "CountyFIPS") %>%
left_join(health, by = "CountyFIPS") %>%
left_join(stores, by = "CountyFIPS") %>%
left_join(restaurants, by = "CountyFIPS") %>%
left_join(taxes, by = "CountyFIPS") %>%
left_join(local, by = "CountyFIPS") %>%
left_join(access, by = "CountyFIPS") %>%
left_join(county_data, by = "CountyFIPS")
merged_df$FoodDesert <- as.numeric(as.character(merged_df$LILATracts_1And10))
predictor_vars <- merged_df %>%
select(-FoodDesert) %>%
select(where(is.numeric))
# Drop columns with >10% missing
missing_pct <- colMeans(is.na(predictor_vars))
predictor_vars <- predictor_vars[, missing_pct <= 0.10]
# Impute remaining NA with medians
predictor_vars <- predictor_vars %>%
mutate(across(everything(), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
is_binary <- function(x) {
unique_vals <- unique(na.omit(x))
length(unique_vals) == 2 && all(sort(unique_vals) %in% c(0, 1))
}
binary_cols <- names(predictor_vars)[sapply(predictor_vars, is_binary)]
continuous_cols <- setdiff(names(predictor_vars), binary_cols)
# Convert binary variables to factor
predictor_vars <- predictor_vars %>%
mutate(across(all_of(binary_cols), as.factor))
# Scale continuous variables
predictor_vars <- predictor_vars %>%
mutate(across(all_of(continuous_cols), scale))
# Prepare modeling data
model_data <- bind_cols(
FoodDesert = merged_df$FoodDesert,
predictor_vars
) %>% drop_na()
set.seed(97)
model_data_sampled <- model_data %>% sample_frac(0.3)
str(model_data_sampled)
predictor_vars <- model_data_sampled %>%
select(-c(FoodDesert,LILATracts_1And10,LILATracts_1And20,LILATracts_halfAnd10,`2010_Census_Population`)) %>%
select(where(is.numeric))
y <- model_data_sampled$FoodDesert
univariate_results <- lapply(names(predictor_vars), function(var) {
temp_formula <- as.formula(paste("FoodDesert ~", var))
model <- glm(temp_formula, data = model_data_sampled, family = binomial())
tidy(model) %>%
filter(term != "(Intercept)") %>%
mutate(variable = var)
})
univariate_df <- do.call(rbind, univariate_results)
selected_vars_univariate <- univariate_df %>%
filter(p.value < 0.05) %>%
pull(variable)
cat("Variables selected after univariate screening:", length(selected_vars_univariate), "\n")
print(selected_vars_univariate)
X_lasso <- model.matrix(
as.formula(paste("~", paste0("`", selected_vars_univariate, "`", collapse = " + "))),
data = model_data_sampled
)[, -1] # remove intercept
y_lasso <- model_data_sampled$FoodDesert
# Lasso logistic regression
set.seed(72)
lasso_fit <- cv.glmnet(X_lasso, y_lasso, family = "binomial", alpha = 1)
best_lambda <- lasso_fit$lambda.min
lasso_coefs <- coef(lasso_fit, s = best_lambda)
selected_lasso_vars <- rownames(lasso_coefs)[lasso_coefs[, 1] != 0]
selected_lasso_vars <- selected_lasso_vars[selected_lasso_vars != "(Intercept)"]
cat("Variables selected after Lasso:", length(selected_lasso_vars), "\n")
print(selected_lasso_vars)
remove_high_vif_iteratively <- function(data, target = "FoodDesert", threshold = 5) {
current_vars <- setdiff(names(data), target)
repeat {
temp_formula <- as.formula(paste(target, "~", paste0("`", current_vars, "`", collapse = " + ")))
temp_model <- glm(temp_formula, data = data, family = binomial())
vif_values <- vif(temp_model)
if (max(vif_values, na.rm = TRUE) < threshold) {
break
}
worst_var <- names(which.max(vif_values))
cat("Removing variable:", worst_var, "VIF =", round(max(vif_values), 2), "\n")
current_vars <- setdiff(current_vars, worst_var)
}
return(current_vars)
}
model_data_reduced <- model_data_sampled %>%
select(all_of(c(selected_lasso_vars, "FoodDesert"))) %>%
drop_na()
# Run VIF-based iterative filtering
final_selected_vars <- remove_high_vif_iteratively(model_data_reduced, target = "FoodDesert", threshold = 5)
cat("Final number of predictors:", length(final_selected_vars), "\n")
print(final_selected_vars)
remove_high_pvalue_iteratively <- function(data, target = "FoodDesert", threshold = 0.05) {
current_vars <- setdiff(names(data), target)
repeat {
temp_formula <- as.formula(paste(target, "~", paste0("`", current_vars, "`", collapse = " + ")))
temp_model <- glm(temp_formula, data = data, family = binomial())
model_summary <- tidy(temp_model)
# Remove intercept row
model_summary <- model_summary %>%
filter(term != "(Intercept)")
# Check max p-value
max_pval <- max(model_summary$p.value, na.rm = TRUE)
if (max_pval < threshold) {
break
}
worst_var <- model_summary %>%
filter(p.value == max_pval) %>%
pull(term) %>%
gsub("`", "", .) # remove backticks
cat("Removing variable:", worst_var, "p-value =", round(max_pval, 4), "\n")
current_vars <- setdiff(current_vars, worst_var)
}
return(current_vars)
}
model_data_reduced <- model_data_sampled %>%
select(all_of(c(final_selected_vars, "FoodDesert"))) %>%
drop_na()
final_significant_vars <- remove_high_pvalue_iteratively(model_data_reduced, target = "FoodDesert", threshold = 0.01)
cat("Final number of predictors after p-value filtering:", length(final_significant_vars), "\n")
print(final_significant_vars)
final_formula <- as.formula(paste("FoodDesert ~", paste0("`", final_significant_vars, "`", collapse = " + ")))
final_model <- glm(final_formula, data = model_data_reduced, family = binomial())
##
## Call:
## glm(formula = final_formula, family = binomial(), data = model_data_reduced)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4968 0.0314 -79.56 < 2e-16 ***
## NUMGQTRS 0.1605 0.0196 8.17 3.2e-16 ***
## PovertyRate 0.4851 0.0289 16.79 < 2e-16 ***
## TractKids 0.2417 0.0379 6.38 1.8e-10 ***
## TractSeniors 0.2166 0.0308 7.03 2.1e-12 ***
## TractWhite -0.5274 0.0409 -12.91 < 2e-16 ***
## TractAsian -0.3717 0.0528 -7.04 1.9e-12 ***
## TractAIAN 0.0965 0.0296 3.26 0.00113 **
## TractSNAP 0.3322 0.0301 11.03 < 2e-16 ***
## PCT_NHNA10 -0.1189 0.0312 -3.81 0.00014 ***
## POVRATE15 -0.2068 0.0312 -6.63 3.4e-11 ***
## CH_FOODINSEC_14_17 -0.1470 0.0326 -4.51 6.4e-06 ***
## PCT_OBESE_ADULTS12 -0.0951 0.0270 -3.52 0.00043 ***
## GROCPTH11 -0.1915 0.0374 -5.12 3.1e-07 ***
## SUPERCPTH16 0.1005 0.0224 4.48 7.3e-06 ***
## CONVSPTH16 0.0886 0.0291 3.04 0.00234 **
## SPECSPTH11 -0.0944 0.0339 -2.78 0.00541 **
## FSRPTH11 0.0764 0.0292 2.62 0.00886 **
## PCT_LOCLFARM07 -0.1869 0.0522 -3.58 0.00035 ***
## PCT_LOCLSALE12 -0.0917 0.0338 -2.71 0.00674 **
## PC_DIRSALES07 0.0726 0.0222 3.27 0.00106 **
## FMRKT13 -0.1886 0.0441 -4.28 1.9e-05 ***
## PCT_FMRKT_SNAP18 0.0788 0.0274 2.88 0.00400 **
## PCT_FMRKT_CREDIT18 -0.0816 0.0264 -3.09 0.00199 **
## PCT_LACCESS_LOWI10 0.2188 0.0346 6.32 2.6e-10 ***
## PCH_LACCESS_HHNV_10_15 0.3866 0.1112 3.48 0.00051 ***
## PCT_LACCESS_HHNV10 0.1741 0.0313 5.57 2.6e-08 ***
## PCT_LACCESS_SNAP15 0.1822 0.0367 4.97 6.7e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16498 on 21758 degrees of freedom
## Residual deviance: 13145 on 21731 degrees of freedom
## AIC: 13201
##
## Number of Fisher Scoring iterations: 6
## [1] "VIF values of the coefficients:"
## PC_DIRSALES07 NUMGQTRS TractAsian
## 1.16 1.20 1.20
## SUPERCPTH16 PCH_LACCESS_HHNV_10_15 CH_FOODINSEC_14_17
## 1.21 1.23 1.27
## PCT_LOCLSALE12 PCT_OBESE_ADULTS12 FMRKT13
## 1.30 1.40 1.52
## GROCPTH11 SPECSPTH11 PCT_LOCLFARM07
## 1.52 1.53 1.55
## FSRPTH11 PCT_FMRKT_SNAP18 PCT_FMRKT_CREDIT18
## 1.56 1.69 1.73
## CONVSPTH16 TractSeniors POVRATE15
## 1.81 1.86 1.90
## PCT_LACCESS_HHNV10 PovertyRate TractAIAN
## 1.94 2.03 2.18
## PCT_NHNA10 TractSNAP TractKids
## 2.26 2.34 2.77
## PCT_LACCESS_LOWI10 TractWhite PCT_LACCESS_SNAP15
## 2.85 3.09 3.16
final_vars_with_target <- c(final_significant_vars, "FoodDesert")
final_data <- model_data_reduced %>%
select(all_of(final_vars_with_target))
final_data_numeric <- final_data %>%
mutate(across(everything(), ~ as.numeric(as.character(.))))
cor_matrix_final <- cor(final_data_numeric, use = "complete.obs", method = "spearman")
cor_with_target <- data.frame(
Variable = rownames(cor_matrix_final),
Correlation = cor_matrix_final[, "FoodDesert"]
)
cor_with_target <- cor_with_target %>%
filter(Variable != "FoodDesert")
cor_with_target <- cor_with_target %>%
mutate(abs_correlation = abs(Correlation))
top10_features <- cor_with_target %>%
arrange(desc(abs_correlation)) %>%
slice(1:5)
top10_var_names <- top10_features$Variable
cor_matrix_top10 <- cor_matrix_final[c(top10_var_names,"FoodDesert"), c(top10_var_names,"FoodDesert")]
corrplot(
cor_matrix_top10,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 0.8,
tl.cex = 0.9,
mar = c(0,0,2,0)
)
boxplot_data <- merged_df %>%
mutate(FoodDesert = as.factor(FoodDesert)) %>%
select(c(top10_var_names,"FoodDesert"))
for (var in top10_var_names) {
p <- ggplot(boxplot_data, aes(x = FoodDesert, y = .data[[var]])) +
geom_boxplot(fill = "green") +
labs(
title = paste("Boxplot of", var, "by Food Desert Status"),
x = "Food Desert (0 = No, 1 = Yes)",
y = var
) +
theme_minimal(base_size = 14)
print(p)
}
It looks like PovertyRate, Vehicle access and SNAP status are the most prominent factors.
These indicators suggest that food insecurity stems from structural inequality, not just poor planning. Households may technically live within a mile of a grocery store, but without a car or reliable public transport, access remains a barrier.
The main drivers of food deserts were found to be low-income, limited vehicle availability and SNAP user share. Our analysis points to multiple intervention policies aimed at each of these drivers that can be introduced at a federal level as well as at the state level to reduce food deserts:
Low-income Tracts:
Expand the Earned Income Tax Credit: Increase the credit for very low-income households living in federally designated food-desert tracts.
Targeted Workforce Development Grants: Embed food-desert zone bonuses into employee training programs, so employers in these areas receive wage subsidies when they hire and upskill local residents.
Limited vehicle availability:
Incentivize Mobile Market Operators: Offer incentives to vendors to serve limited vehicle neighbourhoods.
Zero-Fare Public transport routes: Partner with state transportation services to waive off fares on designated public-transit routes connecting grocery retailers with food deserts.
SNAP enhancements:
Nationwide SNAP Online Ordering & Delivery: Mandate all major grocery chains to accept online SNAP purchases and subsidize delivery fees for EBT users in food desert counties.
Mobile EBT Kiosk Grants: Deploy portable EBT terminals throughout rural food deserts to broaden point-of-sale options.
These policies should be multi-tiered, combining federal, state, and local efforts and tailored to regional needs.
Our analysis confirms that food deserts are complex, multifaceted problems rooted in geography, infrastructure, and inequality. While physical distance to food sources matters, our findings emphasize that income, transportation, and economic vulnerability are stronger predictors of food access.
We found that food deserts are primarily concentrated in the southern region of the United States. Health impacts are not directly caused by food deserts alone, but by other supplementary factors intersecting socio-economic conditions. Thus, any solution must address the full ecosystem—access, affordability, education, and opportunity.
Our predictive models offer a foundation for targeted policy responses and future research. By identifying high-risk areas, we can improve resource allocation and advance equitable access to nutrition for all Americans.
Some limitations of our study include:
Health-impact analysis was narrow since the study was restricted to only 2 variables – diabetes and obesity. However, there are many more health indicators that can be used along with supplementary environmental factors. All these indicators can be utilized to holistically provide a clear view on the relationship between health and food access.
Not all relevant variables were sourced—future research should integrate factors like crime rates, school quality, and walkability from other government annotated datasets.
Our models assume static conditions; longitudinal or time-series data could better capture the dynamics of changing neighbourhoods.
Future directions include:
Incorporating clustering analysis to identify segments more prone to food deserts and time-series analysis to historically map the point of conversion from a non-food desert tract to a food desert tract.
Implementing more robust feature-engineering and modelling processes. This includes creating complex variables as well as using more sophisticated models with hyper-parameter tuning.
Collaborating with urban planners, public health departments, and non-profits for insights and real-world applications.
Ver Ploeg et al. (2009).
Access to affordable and nutritious food: Measuring and
understanding food deserts and their consequences: Report to
Congress.
U.S. Department of Agriculture, Economic Research Service.
https://www.ers.usda.gov/publications/pub-details/?pubid=42729
Walker, R. E., Keane, C. R., & Burke, J. G.
(2010).
Disparities and access to healthy food in the United States: A review of
food deserts literature. Health & Place, 16(5),
876–884.
https://doi.org/10.1016/j.healthplace.2010.04.013
Food Access Research Atlas (Food Deserts data)
U.S. Department of Agriculture, Economic Research Service. (2020).
Food Access Research Atlas data (2019) [Data set].
https://www.ers.usda.gov/data-products/food-access-research-atlas/
Food Environment Atlas (Socioeconomic, Health, Access, Store, Restaurant, Tax, and Local data)
U.S. Department of Agriculture, Economic Research Service. (2020). Food Environment Atlas [Data set]. https://www.ers.usda.gov/data-products/food-environment-atlas/